Introduction to RBiotools

RBiotools is a package for COMPARATIVE MICROBIAL GENOMICS. RBiotools works best in the RStudio integrated development environment. RBiotools works with IDs from GenBank, the NIH genetic sequence data base, to perform analyses and create figures.

Before installation,

Install the RBiotools package and other required packages

1. install and update dependencies

  #Install some basic dependencies that are required for RBiotools
  install.packages(c("rlang","ape", "fmsb", "gplots", "grImport", "gridExtra", "msa", "pheatmap", "RCurl", "rentrez", "seqinr"), repos =" http://cran.us.r-project.org ")
  install.packages(c("lattice","mgcv","nlme","survival"), repos =" http://cran.us.r-project.org ")

  
  #Install msa and Biostrings package via BiocManager (and install BiocManager if it is not already installed)
  if (!requireNamespace("BiocManager", quietly = TRUE))
      install.packages("BiocManager", repos =" http://cran.us.r-project.org ")

  BiocManager::install("msa")
  BiocManager::install("Biostrings", force = TRUE)

  #Install installr - needed to load RBiotools from tar.gz
  install.packages("installr", repos =" http://cran.us.r-project.org ")
  ## this library call is required so the dendrogram will work
  library(msa) 
  
  
  ## this library call is required so we can install RBiotools from the zip source
  library(installr) 

2. install the RBiotools package from zip source; and load the RBiotools library

  install.packages("C:/dc3/RBiotools_0.5.0.tar.gz", repos = NULL, type = "source")
  library(RBiotools) 

3. Setup the environment and working directory

Choose a directory / folder that will be used for some of the figures, including Heat Maps and Genome Atlases:

  • Create this directory on your computer, if it does not already exist
  • Make this directory the “Working Directory”
ifelse(!dir.exists(".//r_repository//"), dir.create(".//r_repository//"), "Folder exists already")

setwd("C:/dc3/R_repository") #setting the working directory

Introduction to RBiotools

The first step is to choose a set of organisms that you would like to explore.

What is an organism?

  • Organisms are specified with GenBank identifiers (accession IDs?)
  • Organisms NOT in Genbank can be added with the RBiotools addGenome function

Let’s use a sample sets of organisms to get started. Specifically, a list of E. coli and Shigella organisms. This set will be stored using the name eColi.

eColi <- c(
  "AP012306",  # Escherichia coli str. K-12 substr. MDS42 DNA         3.976 Mb - smallest chromosome
  "KK583188",  # Escherichia coli DSM 30083 = JCM 1649 = ATCC 11775   4.907 Mb - type strain scaffold
  "U00096",    # Escherichia coli str. K-12 substr. MG1655            4.642 Mb - first E. coli genome sequenced
  "CP000802",  # Escherichia coli HS                                  4.644 Mb - group A representative, commensal
  "CP000800",  # Escherichia coli E24377A                             4.980 Mb - group B1 representative
  "AP009378",  # Escherichia coli SE15                                4.717 Mb - group B2 representative, commensal
  "FM180568",  # Escherichia coli 0127:H6 E2348/69                    4.966 Mb - group B2 representative, enteropathogenic
  "CU928163",  # Escherichia coli UMN026                              5.202 Mb - group D representative
  "CP008957",  # Escherichia coli O157:H7 str. EDL933                 5.547 Mp - group E representative
  "CP027027",  # Shigella dysenteriae strain E670/74                  5.037 Mb - Shigella dysenteria representative
  "CP026802",  # Shigella sonnei strain ATCC 29930                    4.975 Mb - Shigella sonnei representative
  "CP026877",  # Shigella boydii strain ATCC 35964                    5.129 Mb - Shigella boydii representative
  "CP026793",  # Shigella flexneri strain 74-1170                     4.734 Mb - Shigella flexneri representative
  "CP015831"   # Escherichia coli O157 strain 644-PT8                 5.831 Mb - largest chromosome
)

and here we have a list of organisms that are not as closely related from the Proteobacteria classes. This set will be stored using the name proteobacteria.

proteobacteria <- c(
  "CP018228", # Rhizobium leguminosarum strain Vaf-108              Phylum: Proteobacteria (alpha)
  "CP017405", # Bordetella pertussis strain J448                    Phylum: Proteobacteria (beta)
  "CP008957", # Escherichia coli O157:H7 str. EDL933                Phylum: Proteobacteria (gamma)
  "HG530068", # Pseudomonas aeruginosa PA38182                      Phylum: Proteobacteria (gamma)
  "CP002031", # Geobacter sulfurreducens KN400                      Phylum: Proteobacteria (delta)
  "CP002332"  # Helicobacter pylori Gambia94/24                     Phylum: Proteobacteria (epsilon)
)

Explore functions that make plots and figures

Plot a dendrogram using 16S rRNA genes

  #This first plot is for the eColi and Shigella organisms
  dendrogram16S(eColi)
## Initializing RBiotools
## Downloading genome data for organism with accession ID: AP012306 
## Downloading genome data for organism with accession ID: KK583188 
## Downloading genome data for organism with accession ID: U00096 
## Downloading genome data for organism with accession ID: CP000802 
## Downloading genome data for organism with accession ID: CP000800 
## Downloading genome data for organism with accession ID: AP009378 
## Downloading genome data for organism with accession ID: FM180568 
## Downloading genome data for organism with accession ID: CU928163 
## Downloading genome data for organism with accession ID: CP008957 
## Downloading genome data for organism with accession ID: CP027027 
## Downloading genome data for organism with accession ID: CP026802 
## Downloading genome data for organism with accession ID: CP026877 
## Downloading genome data for organism with accession ID: CP026793 
## Downloading genome data for organism with accession ID: CP015831 
## Calling type 'SSU' rRNA genes for genome with accession ID: AP012306
## Calling type 'SSU' rRNA genes for genome with accession ID: KK583188
## Calling type 'SSU' rRNA genes for genome with accession ID: U00096
## Calling type 'SSU' rRNA genes for genome with accession ID: CP000802
## Calling type 'SSU' rRNA genes for genome with accession ID: CP000800
## Calling type 'SSU' rRNA genes for genome with accession ID: AP009378
## Calling type 'SSU' rRNA genes for genome with accession ID: FM180568
## Calling type 'SSU' rRNA genes for genome with accession ID: CU928163
## Calling type 'SSU' rRNA genes for genome with accession ID: CP008957
## Calling type 'SSU' rRNA genes for genome with accession ID: CP027027
## Calling type 'SSU' rRNA genes for genome with accession ID: CP026802
## Calling type 'SSU' rRNA genes for genome with accession ID: CP026877
## Calling type 'SSU' rRNA genes for genome with accession ID: CP026793
## Calling type 'SSU' rRNA genes for genome with accession ID: CP015831
## Creating dendrogram for 14 16S rRNA sequences
## use default substitution matrix

  #The second plot is for the proteobacteria organisms
  dendrogram16S(proteobacteria)
## Downloading genome data for organism with accession ID: CP018228 
## Downloading genome data for organism with accession ID: CP017405 
## Downloading genome data for organism with accession ID: HG530068 
## Downloading genome data for organism with accession ID: CP002031 
## Downloading genome data for organism with accession ID: CP002332 
## Calling type 'SSU' rRNA genes for genome with accession ID: CP018228
## Calling type 'SSU' rRNA genes for genome with accession ID: CP017405
## Calling type 'SSU' rRNA genes for genome with accession ID: HG530068
## Calling type 'SSU' rRNA genes for genome with accession ID: CP002031
## Calling type 'SSU' rRNA genes for genome with accession ID: CP002332
## Creating dendrogram for 6 16S rRNA sequences
## use default substitution matrix

Create a Genome Atlas for a genome

  createAtlas(eColi[1]) # here we index into the list and grab the first item; AP012306 or E.coli str. K-12 substr. MDS42 DNA

Genome Atlas for eColi group

Note: To view this Genome Atlas, which is stored as an SVG file

*   Open a browser ... Safari, Chrome, Firefox, ...
*   Click "File" in the menu bar and choose "Open File ..." from the pull-down menu
*   Navigate to your Working Directory
*   Double-click on "GenomeAtlas_AP012306.svg"

Plot amino acid usage

Note: The plotUsage function has many options. See the plotUsage documentation.

  plotUsage(eColi[1])
## ----------
## Predicting protein genes for genome with accession ID: AP012306 
## Request: Single Genome, Phase: Training
## Reading in the sequence(s) to train ... 3976195 bp sequence created, 51.14% GC
## Looking for GC bias in different frames ... frame bias scores: 1.56 0.182 1.25 
## Building initial set of genes to train from ...
##  done!
## Creating coding model and scoring nodes ... done!
## Examining upstream regions and training starts ...
##  done!
## Request: Single Genome, Phase: Gene Finding
## Finding genes in sequence # 1 (3976195 bp) ... 
##  done!  3621 genes predicted
## Plotting AA usage across 3621 proteins

Create a codon heat map

Note: Heat Maps work best for groups of organisms that are NOT closely related (like our Proteobacteria group?)

  plotHeatMapCodon(proteobacteria)
## Calling protein genes for accession ID: CP018228 
## Calling protein genes for accession ID: CP017405 
## Calling protein genes for accession ID: CP008957 
## Calling protein genes for accession ID: HG530068 
## Calling protein genes for accession ID: CP002031 
## Calling protein genes for accession ID: CP002332

Heat Map Codon for Proteobacteria group

Note: The heat map is now in your Working Directory.

 *    Click on the "Files" tab in the lower left quadrant
 *    Navigate to your Working Directory (hint: you may need to click on the name of your Working Directory to update it)
 *    Click on the "HeatMapCodon.png" file

Plot a Blast Matrix for a set of organisms

Note: Blast Matrices work best for groups of closely related organisms (like our eColi group?)

  1. Build a “homology matrix”, a table where
  • each table row is a protein group
  • each table column is an organism
  • each table entry is the number of an organism’s proteins in a group
  proteinGrouping <- runLinclust(eColi)
## Calling protein genes for accession ID: KK583188 
## Calling protein genes for accession ID: U00096 
## Calling protein genes for accession ID: CP000802 
## Calling protein genes for accession ID: CP000800 
## Calling protein genes for accession ID: AP009378 
## Calling protein genes for accession ID: FM180568 
## Calling protein genes for accession ID: CU928163 
## Calling protein genes for accession ID: CP027027 
## Calling protein genes for accession ID: CP026802 
## Calling protein genes for accession ID: CP026877 
## Calling protein genes for accession ID: CP026793 
## Calling protein genes for accession ID: CP015831 
## Number of ungapped alignments: 112023
## Number of Smith-Waterman alignments: 2514
  1. Use the homology matrix to plot the Blast Matrix
  plotBlastMatrix(proteinGrouping)

Blast MAtrix for eColi group

Note: To view this Homology Matrix, which is stored as an SVG file

*   Open a browser ... Safari, Chrome, Firefox, ...
*   Click "File" in the menu bar and choose "Open File ..." from the pull-down menu
*   Navigate to your Working Directory
*   Double-click on "BlastMatrix.svg"